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Abstract 

In data mining, it is usually to describe a set of individuals using some 
summaries (means, standard deviations, histograms, confidence intervals) 
that generalize individual descriptions into a typology description. In this 
case, data can be described by several values. In this paper, we propose an 
approach for computing basic statics for such data, and, in particular, for 
data described by numerical multi-valued variables (interval, histograms, 
discrete multi- valued descriptions). We propose to treat all numerical 
multi- valued variables as distributional data, i.e. as individuals described 
by distributions. To obtain new basic statistics for measuring the vari- 
ability and the association between such variables, we extend the clas- 
sic measure of inertia, calculated with the Euclidean distance, using the 
squared Wasserstein distance defined between probability measures. The 
distance is a generalization of the Wasserstein distance, that is a distance 
between quantile functions of two distributions. Some properties of such 
a distance are shown. Among them, we prove the Huygens theorem of de- 
composition of the inertia. We show the use of the Wasserstein distance 
and of the basic statistics presenting a k-means like clustering algorithm, 
for the clustering of a set of data described by modal numerical variables 
(distributional variables), on a real data set. 

Keywords:Wasserstein distance, inertia, dependence, distributional data, 
modal variables. 

1 Introduction 

In many real experiences, data are collected and/or represented by multi-valued 
descriptions: intervals, frequency distributions, histograms, density distribu- 
tions, and so on. Several approaches have been presented in the literature for 
processing such data. In particular, when data takes multiple values in subsets 
of 5R, they can be analysed using a Interval arithmetic approach [20.], a fuzzy 
set approach, or a Symbolic Data Analysis approach Q. When data domain 
is categorical, one of the main interesting approach is the Compositional data 
one Without a loss of generality, when necessary, we refer to the Symbolic 
Data Analysis approach that can be considered as a generalization of the other 
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cited ones. Indeed it allows processing interval, multi-valued discrete, multi- 
categorical, histogram and modal descriptors 0, 0, 0] ■ The last ones can model 
the description of an individual, or of a concept, by distribution of probabilities, 
frequencies or, in general, by random variables. 

In the last years, several authors proposed and defined new statistics and 
new techniques for the analysis of a particular case of modal data description: 
the histogram- valued data. An interesting approach is due to Billard and Diday 
Q. They define new elementary statistics, association measures and a linear 
regression technique for the analysis of this kind of data. Further, Irpino et al. 
[15| proposed an extension of the dynamic clustering algorithm and hierarchical 
clustering for data described by histograms. Data can be described by several 
variables in a multivariate fashion. The first problem to solve in the analysis of 
multivariate data is the standardization of such data in order to balance their 
contribution to the results of the analysis. Other approaches dealing with the 
computation of the variability of a set of complex data can be found in Q , Q 
and ^ . Billard Q and Bertrand and Goupil jH apply the concept of variability 
to interval-valued data considering an interval-valued realizations on [a, b] that 
is uniformly distributed U ~ {a,b). On this basis, Bertrand and Goupil Q 
developed some basic statistics to interval data and Billard @ extended them 
for the computation of dependence and interdependence measures for interval- 
valued data. After presenting multi- valued numerical data in section [2l we 
introduce (Section ^ a method for the computation of the mean of a set of dis- 
tributions. The mean of a set of distribution is calculated minimizing the sum 
of squared Wasserstein distance between each pair of distributions. The 
distance can be considered as an extension of the Euclidean distance between 
quantile functions. Section |4] show how to compute the variance and the stan- 
dard deviation of a set of distributions, according to the squared Wasserstein 
distance. We show that the proposed approach is consistent with the classical 
concept of location, variability and shape measures of distributions. 
When data are described by more than one variable, a measure of association 
between variables is needed. In Section we propose a way for the extension 
of the classical covariance and correlation measures between two standard vari- 
ables to the case of numeric modal variables. 

In order to present an application that uses all the proposed statistics, in Sec- 
tion [71 we present a Dynamic Clustering Algorithm (DCA) [l^l (a k- means like 
algorithm) for data described by histogram where a Mahalanobis version of the 
Wasserstein metric [l^ is used for the allocation of data to the clusters. Section 
[8] ends the paper with some comments and advices for future research. 



2 Numerical modal-valued data 

In this paper, we follow the Symbolic Data Analysis_(SDA) approach for the def- 
inition of data introducing some generalizations [1, Q ■ SDA aims to extend clas- 
sical data analysis and statistical methods to more complex data called symbolic 
data. Considering the classical situation with a set of units E = {1, i, N} 
and a set of p variables j/i, j/j, Bock and Diday Q define symbolic variables 
as follows: 
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Definition 1. A variable y is termed set-valued with domain Y , if for all i £ E, 



y-^-f (1) 

where the description D is defined by D = (p{Y) = {U ^ %\U C 1"}. A set- 
valued variable y is called multi- valued if its description set Dc is the set of all 
finite subsets of the underlying domain Y ; such that \y{i)\ < oo, for all i £ E. 

A set-valued variable y is called categorical multi- valued if it has a finite set 
Y of categories and quantitative multi-valued if the values y{i) are finite sets 
of real numbers. 

A set-valued variable y is called interval-valued if its description set Dj is 
the set of intervals of 3R. 

Definition 2. A modal variable y on a set E of objects with domain Y is a 
mapping 

2/(z) = (5W,^,),VzGi? (2) 

where tt^ is a measure or a (frequency, probability or weight) distribution on the 
domain Y of possible observation values (completed by a a-field), and S{i) C Y 
is the support o/tt.; in the domain Y. The description is denoted by Dm- 

In the present paper, we do not treat the multi- categorical case, but only 
those descriptions based on numerical support. Indeed, if the support is cate- 
gorical, we are in presence of a compositional data description [1], expressed by 
vectors of nonnegative real components having a constant sum. 

We propose to treat all numerical (single-valued or set-valued) description 
as particular cases of the modal description. Considering the definition [21 we 
treat data in a probabilistic perspective, as distributional data. In order to 
follow the terminology adopted in SDA, the variables which allow distributions 
as description of individuals are termed modal-numeric descriptors. 

Definition 3. Given a set E of objects with domain Y and support S{i) parti- 
tioned into Hi subsets, a probability measure defining a density function and 
the respective distribution function ^P, such that 



= Sh{i)) = / My)dy (3) 

where h — 1, ...,ni, a modal variable y is a mapping 

y{^)^{Sh{i),^^^{Sh{i))}y^eE. (4) 

In the following, we consider the main types of symbolic numeric descrip- 
tors. Once defined the support S{i), the density function ip and the distribution 
function 5*, we propose how to consider them as particular modal- numeric de- 
scriptor. 

Classic single valued data S{i) — yi such that yi 6 5ft, and Hi — 1 

An individual i £ E is described by a single value yi. It is possible to 
consider it as a modal-numeric variable with associated a density function 
that follows as Dirac delta function shifted in y^: 

Uy)-S{y-y,)^^ Q ot«,e 
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subject to the constraint that S{y — yijdx = 1. 
The corresponding distribution function is: 



= Vi) 



rvt 

< yt) - < % ) = / _ ^{y- yi)dy = i 



In this case the modal-numeric description is: 

y{i) = ivi, = = {yi, !)• 

Interval description S{i) = [a.i,bi] such that tti < yi < bi, and assuming a 
uniform distribution in S{i) = [ai,bi], we can rewrite TTj as 



i^i{y) = 



- J b, 



-z^ if ai <y < h 



The corresponding distribution function is: 

{0 if y < cii 

Ilb^^dy if a,<y<h 
1 if y>bi 

In this case the modal-numeric description is: 

y{i) = {[ai,bi],-^i{ai <y< bi)) = {[ai,bi],l). 

If we have information about the distribution of the data in the interval 
we may consider 'ii{y) as a the (cumulative) distribution function corre- 
sponding to i^i{y). 

Histogram valued description We assume that S{i) = [z^-jJi] (the support 
is bounded), where Zi e SR. The support is partitioned into a set of rij 
intervals S{i) = {lu, lui, —, Imi}, where In = \zn,zii) and I = 1, rij, 
i.e. 

i. In n = 0; ly^m ; 
a. U In = S{i) 

1=1,. ..,ni 

Histograms suppose that the values observed in each interval are a uni- 
formly distributed. It is possible to define the modal description of i as 
follows: 

y{i) = {{Iii,iTii) I Vlii e S{i}; TTii = *i(^^. < y < yu) = j ■tlJi{z)dz > 0} 

where / 4>i{y)dy = 1. 
s{i) 

Each element of the support is associated with a tth, such that 
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Given the generical interval In = [z^^'zu] where < zn, and U{y\Iii) = 
U{y\zn,zii) as the Uniform continuous function defined between z^ and 
zii, we may rewrite an histogram as a hnear combination of Uniform dis- 
tribution (a mixture) as follows: 

1=1 

where tpi{y) is a density function associated to the description of i and the 
corresponding distribution function is: 

b 



/ 

^.(y <b) = J2[^H U{y\lH)dy 

I_1 V J — oo 



Multi- valued discrete description Like in the histogram description, modal 

niulti- valued discrete description can be considered as a mixture of Delta 
dirac distributions, where S{i) is a set of distinct single values. 

The support can be written also as S{i) — {yu, ■■■,yii, ■■■,ynii}- Each 

element of the support is associated with a nu, such that ^ nu = l.We 

1=1 

then consider the function: 



1=1 



where ipi (y) is a density function associated to the description of i and the 
corresponding distribution function is: 



rii / fb ^ 
1=1 \ J 



The same definition can be adapted when we need to describe an individual by 
mean of a continuous random variable. 

Continuous random variable S{i) correspond to the support of the random 

variable, ipiiy) correspond to its density function. 

We can consider, then, the density as 

i'iiy) = fi{y\®), 

where is a vector of parameters, and the distribution function as 



^i{y <b)= f fi{y\@)dy. 

J —CO 



In order to define new basic statistics (mean, standard deviation), we need 
to do some assumption about data. Also, it is important to define a way for 
computing distances or inertia measures among data. 

Further, we need a way to define an equivalence relation and, if it is possible 
an order relation among data. Further, we need to measure the dissimilarity 
between two multi-valued descriptions. 
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3 The mean of a set of distribution as a mini- 
mizer of the inertia 



In order to define the mean of a modal numerical variable, we need to introduce 
an operator that satisfies some invariance properties. It is known, that the 
arithmetic mean is a statistics that holds the following properties: 

Invariance with respect the sum i.e. given a set of n elements described 
by the variable y the mean My respect to the following equation 



The mean is the value that minimize the inertia The (Moment of) iner- 
tia of a set of distribution is defined as the sum of squared distances be- 
tween all the elements of set and its barycenter. Given a set of n elements 
described by the variable X the mean (barycenter) AI is the argmin of the 
following minimization problem: 



Two main issues are invoked from such conditions: the definition of the sum of 
distributions, and the definition of a consistent distance between distributions. 
The first problem is strictly related to the last one. Indeed, we may define the 
sum (or linear combination) of distributions once defined a distance function be- 
tween two distributions. In 's'l, it is proposed a review of distances that can be 
used for comparing distributions and for defining the mean (barycenter) element 
which minimizes the inertia. First of all, when we treat data represented like 
random variables, we observed that it is preferable to work with their distribu- 
tion functions. We observed that in two cases it is possible to define a barycenter 
element that can be represented as a distribution: using the L2 norm and the 
Wasserstein-Kantorovich-Monge-Gini-Mallows L2 distance. In the first case, 
the barycenter random variable of a set of data described as random variables is 
their mixture. In the last case, the barycenter of a set of data described as ran- 
dom variables can be represented by a random variable where the quantiles of 
such barycenter variable correspond to the mean of the corresponding quantiles 
of the data: i.e., the quantile function of the barycenter random variable is the 
mean of the quantile functions associated with the data distributions. In the 
next paragraph we present the Wasserstein-Kantorovich-Monge-Gini-Mallows 
1/2 Wasserstein-Kantorovich-Monge-Gini-Mallows L2 distance (we call it sim- 
ply Wasserstein distance) and its properties. 

3.1 Wasserstein distance between distributions 

If {y) and $j (y) are the distribution functions of two random variables (f>i (y) 
and {y) respectively, with first moments /i^ and fij , and Si and Sj their stan- 



n 




n n 



Y^{y,-MY^Y.d\y,,M) 
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dard deviations, the Wasserstein L2 metric is defined as |14| 

1/2 



dw{(t>i{y)Aj{y)) 



{<^-\t)-'^-\t)fdt 



(5) 



where ^^ ^{t) and 4>^- ^(i) are the quantile functions of the two distributions. It 
is possible to prove (see|X]) that the distance can be decomposed as: 

d?w (y) , (.y) ) - (JM- + (s^ Sjf + 2s,5,- (1 - pQQ \ <^J') \ (6) 

Location Size Shape 

where 

/ {^-\t) ~ /i,) {^-^(t) - p.,) dt J 'fr\t)^j\t)dt - f,,p, 

PQQ {^7\<^>J') = = 

(7) 

is the correlation coefhcient of the quantiles of the two distributions as repre- 
sented in a classical QQ plot. It is worth noting that < pqq < 1, differently 
from the classical range of variation of the Bravais-Pearson's correlation coeffi- 
cient p. This decomposition allows us to take into consideration three aspects in 
the comparison of distribution functions. The first aspect is related to the loca- 
tion: two distributions can differ in position and this aspect is explained by the 
distance between the mean values of the two distributions. The second aspect 
is related to the different variability of the compared distributions: the different 
standard deviations and the different shapes of the density functions. While the 
former sub-aspect is taken into account by the distance between the standard 
deviations, the latter sub-aspect is taken into consideration by the value of pqq. 
Indeed, pqq is equal to one only if the two (standardized) distributions have 
the same shape. 

Considering that a quantile function is a non-decreasing function / : [0, 1] — >■ 5R 
such that — y{i), an interesting result is the possibility of defining the 

internal product of two quantile functions from equation [T] 

Definition 4. Given two quantile functions and , associated with two 
pdf's ipiiy) o.'iT'd 4>i{.y) with means pi and pj and standard deviations Si and st, 
their inner product is defined as follows: 

1 

($-1, $-1) - J <i>r'mj\t)dt = pqq(<D-\ $-i),s,s, + p,p,. (8) 



The proof is straightforward using few algebra from equation [T] 
Using this distance, we introduce an extended concept of inertia for a set of 
distributions. 

The squared Wasserstein distance allow us to introduce the sum of a set 
of quantile functions. It is known that the sum of non-decreasing functions is 
itself a non-decreasing functior0, i.e., having n quantile functions, the S~'^{t) 



^In general, the difference between two non-decreasing functions is not a non-decreasing 
function. Tfien, we are not able to define a difference operator between quantile functions. 
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function defined as follows is itself a quantile function: 



(9) 



defining the product between a scalar k G 5f+ and a quantile function F ^{t) 
as: 

k<i>r\t) Vte[0,l] (10) 
we can define the mean quantile function (or barycenter) $~^(i) as: 



1 



$-i(i) = -S'-i(i) VtG[0,l]- 



(11) 



With l>~^(i) can be associated the distribution function of the barycenter that 
we denote as ^{y) and its density function as 4>{y)- Being a distribution, we 
can compute also the mean of the barycenter as 



+ 00 



y ■ 4>{v)dv 



(12) 



and its standard deviation as 



-|-oo 



(y-Hy) ■cj){y)dy 



(13) 



The last result is very interesting. Indeed, it states that the barycenter of a 
set of data described by distributions is a distribution. If we have single valued 
data (points), the barycenter is a point (i.e. it generalizes the arithmetic mean 
of a set of standard data), if we have interval- valued data, the barycenter is an 
interval valued description, if we have histogram- valued data, the barycenter is 
a histogram. 



4 The inertia of a set of data described by modal 
numeric variables 

A representative (prototype, barycenter) yE associated with a set E oi n ele- 
ments described by a random variables y defined on D C 3? is an element of the 
space of description of E. Extending the inertia concept of a set of points to a 
set of distributions, we define such inertia as: 



InerUaE = ElLi rf^(y(0, 2/) = Eti / i'^i'it) - ^-\t))' dt = 





(14) 

The yE barycenter is obtained by minimizing the inertia criterion in (|14p , in the 
same way as the mean is the best least squares fit of a constant function to the 
given data points. 



8 



Fe is a distribution where its t~th quantile is the mean of the t—th quantiles 
of the n distributions belonging to E. We introduce new measures of variabihty 
which is consistent with the classical concept of variability of a set of elements, 
without discarding any characteristics of the complex data (bounds, internal 
variability, shape, etc.). 

It is interesting to note that the Wasserstein distance allows the Huygens 
theorem of decomposition of inertia for clustered data. Indeed, we showed 
[lil [isj that it can be considered as an extension of the Euclidean distance 
between quantile functions. 

Reasoning by analogy with the classic single-valued numerical data, the 
inertia of a set of n points described by single valued real variable yi G 5i 
(i = 1, . . . , n) is given by the sum of the squared Euclidean distance of each pair 
of observations: 



Inertia{y) = XI ^^(y^, = XI ~ 2^^)^ 

i=l j = l i=l J = l 

It can be proved that: 

n 

Inertia{y) = 2n (y^ — y)'^ — 2n ■ sSy = 2n^ • Sy 

i=l 



(15) 



Where sSy is the sum of squared difference of the points from the mean and Sy 
is the variance. In our case, we generalize such statistics to the case of modal 
numerical variables as follows: 



n n n n ^ 

InerUa{y)^J2T.'^w{y{^).yij))^J2T. {'^7' (t) - ^j' (t))' dt. (16) 



Also in this case, it is possible to prove that: 



Inertia{y) = "^n"^ j (^'^ (t) - {t)f dt. 



We define sSy as the sum of squares difference from the barycenter <& ^ and 



Sy'' as the variance of a modal numerical variable and we can write: 



Inertia{y) — 2n ■ sSy = 2n ■ (s. 



(17) 



The definition of the variance and of the standard deviation allow us to define a 
measure of variability of a set of modal multi- valued numerical data. We define 
the standard deviation as: 



i=l 



1/2 



1=1 



1/2 



(18) 



The main properties of this measures are the same of a classical variability 
measures: 
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Non- negativity Sy > 0. 

Constant data description If all data have the same modal multi-valued nu- 
merical description Sy — 0. 

Shrinking Given two real numbers /i ^ and k: 



Comparing this measure with those introduced by Bertrand and Goupil [5| and 
extended by Billard and Diday 0] , the main differences are related to the value 
of standard deviation when the data have the same description. In that case, 
the standard deviation proposed by Billard and Diday Q is generally greater 
than zero also when data have the same modal numerical description. 

5 Measures of interdependence between modal 
numerical variables 

In this section, we introduce new statistics for measuring the interdependence 
between two modal multi-valued numerical variables. We start introducing a 
new measure for the covariance between two variables denoted by y and z. We 
propose to extend the covariance measure for modal multi-valued numerical 
data as: 



For each individual we know only the marginal distributions (the modal multi- 
valued numerical description for each variable) of the multivariate distribution 
that has generated it, and it is not possible to known the dependency structure 
between two modal multi-valued numerical descriptions observed for two vari- 
ables. We assume that each individual is described by independent descriptions 
for each variables. This is commonly used in the analysis of symbolic data Q. 
On this assumption, given two modal numerical variables y and z, a set E of 
n modal numerical data with distribution Fi{y) and Fi{z) (i — 1, . . . and 
considering the barycenter distributions F{y) and F(z) of E for the two vari- 
ables, we propose to extend the classical codeviance (ssxy) measure to modal 
numerical variables as: 



(h-y+k) 



\h\s^ 



1 



ss: 



.F 




(20) 



Recalling equation ([7]), we may express it as: 



n 



ss: 



.F 

'y^ 



(21) 




where 
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• ai = pqq {F~^ , Pj^^) is the QQ-corrclation between the quantilc function 
for the y variable and the quantile function for the y variable observed for 
the i — th individual, 

• Pi = Pqq i^i^^ : is the QQ-correlation between the quantile function 
of the barycenter of the y variable and the quantile function for the z 
variable observed for the i — th individual, 

• 7i = Pqq i^iy^ j ^^^) is the QQ-correlation between the quantile function 
of the barycenter of the z variable and the quantile function for the y 
variable observed for the i — th individual,, 

• 5 = Pqq {^F~^,F~^) is is the QQ-correlation between the quantile func- 
tion of the barycenter of the y variable and the quantile function of the 

barycenter of the y variable. 

As a particular case, if all the distributions have the same shape (for example, 
they follow Gaussian distributions) then pqq's are equal to 1 and sSy^ can be 
simplified as 



^^yz — ( ^ ^ l^iyl^iz '^P'yl^z j "i" | ^ ] ^iy^iz TlSyS 




It is interesting to note that this approach is fully consistent with the classical 
decomposition of the codeviance. Indeed, we may consider a distribution as an 
information related to a group of individuals. It can be proven that having a set 
of individuals grouped into k classes, the total codeviance can be decomposed 
in two additive components, the codeviance within and the codeviance between 
groups. With minimal algebra it is possible to prove that \sSy^\ cannot be 



greater than y ss'^ ■ ss^. Then, we introduce the correlation measure for two 
modal multi- valued numerical variables as: 



F F 

F _ ^^yz _ ^yz 



It is worth noting that if all modal multi-valued numerical descriptions have are 
identically distributed except for the first moments, r^^ depends only on the 
correlation of their first two moments, and that if r^^ = 1 (resp. -1) then all 
the histograms have their first moment aligned along a positive (resp. negative) 
sloped line and are identically distributed (except for the first two moments). 

6 Using basic statistics for modal numeric data: 
Mahalanobis— Wassertein distance 

The proposed statistics can be useful for extending several algorithms of data 
analysis from classical single-valued numerical data to modal multi-valued nu- 
merical data. 

Using standard deviation and covariance measure we can, for example, in- 
troduce the Mahalanobis version of the Wasserstein distance as follows. 
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Given a set E oi n individuals described by p modal numerical variables, 
each individual can be described as a vector y(i) = [yi{i), ■ ■ ■ )2/p(*)]- Let the 
variance-covariance matrix be denoted as — [ss^^Jpxp, its corresponding 
inverse E^^^ = [ahk] we can introduce the Mahalanobis-Wasserstein distance 
as follows: 



dMw{y{i),y(.i')) 



A E E ^'^^ {P^nit) - F-^{t)) {Fruit) - F^m dt 
\ u=ik=i-^o 



the squared distance can be written: 



(23) 



dMw{y{-i),y{i'')) = E akkd'^ {Fik,Fi,k) + 

k=l 

+2 E t auk {F-,\t) - Fi^it)) {F-\t) - Fr;^}{t)) dt = 

U=l k=h 
P 

= E 0,kkdy^ (Fik,Fi>k) + 
fc=l 
p-1 p 

+2 E E o-hk [{auk - Puk - luk + 5hk) + il-Mh - Mi'/i) ifMk - f^i'k) 

U=l k=h 



where: 



(24) 



aUk - PQQ {F,h^,F.^'^) ■ SihS^k ; I3uk - PQQ {Fih,F.^,l) ■ S^hSi'k] 

Ihk = PQQ {F^k , F^^ ) ■ SikSi'h ; Shk = PQQ {F^h 1 F^k ) ■ Si'kSi'k- 

If all distributions have the same shape (i.e., the distributions differ only for 
their first two moments) the distance can be simplified as: 



rfMw(y(«),y(«')) = E d'^{yk{i),yk{i'))akk + 

p-i p (25) 
+2 E E ii^ih. - Si'/i) [S'lk ~ Si'k) + [piu - Pi'h) [pik - Pi'k)] ahk- 

U=l k=h 



7 An application on a climatic dataset 

In this section, we show some results of clustering of data describing the mean 
monthly temperature, pressure, relative humidity, wind speed and total monthly 
precipitations of 60 meteorological stations of the People's Republic of Chinslj, 
recorded from 1840 to 1988. For the aims of this paper, we have considered 
the distributions of the variables for January (the coldest month) and July 
(the hottest month), so our initial data is a 60 x 10 matrix where the generic 
(i,j) cell contains the distribution of the values for the j — th variable of the 
i ~ th meteorological station. Figure [T] shows the geographic position of the 60 
stations, while in Table [T] we have the basic statistics as proposed in section |4j 
and in Table [5] we show the interdependency measures as proposed in section [S] 
In particular, the upper triangle of the matrix contains the covariances, while 
the bottom triangle contains the correlations for each couple of the histogram 
variables. 

^Dataset URL: http : //dss .ucar . edu/datasets/ds578 . 5/ 
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Figure 1: The 60 meteorological stations of the China dataset; beside each point 
there is the elevation in meters. 





Variable 


H 


'^0 






2/1 


Mean Relative Humidity (percent) Jan 


67.9 


7.0 


127.9 


11.3 


2/2 


Mean Relative Humidity (percent) July 


73.9 


4.5 


114.2 


10.7 


ys 


Mean Station Pressure (mb) Jan 


968.3 


3.6 


5864.7 


76.5 


2/4 


Mean Station Pressure(mb) July 


951.1 


3.0 


5084.4 


71.3 


2/5 


Mean Temperature (Cel.) Jan 


-1.2 


1.7 


114.8 


10.7 


2/6 


Mean Temperature (Cel.) July 


25.2 


1.0 


11.3 


3.4 


2/7 


Mean Wind Speed (m/s) Jan 


2.3 


0.6 


1.1 


1.0 


2/8 


Mean Wind Speed (m/s) July 


2.3 


0.5 


0.6 


0.8 


2/9 


Total Precipitation (mm) Jan 


18.2 


14.3 


519.6 


22.7 


2/10 


Total Precipitation (mm) July 


144.6 


80.8 


499.9 


70.7 



Table 1: Basic statistics of the histogram variables: Hj and aj are the mean 
and the standard deviation of the barycenter distribution of the j — th variable, 
while s^^ and are variance and the standard deviations as presented in this 
paper. 



Vars 


Ul 


.'72 


.'V.i 


.'74 


Ur> 


.'7(i 




.'7s 


.'79 


.'710 


yi 


128.0 


49.1 


510.2 


486.1 


34.0 


20.0 


0.7 


1.6 


109.9 


97.9 


2/2 


0.41 


114.2 


392.6 


376.4 


53.5 


11.2 


4.2 


1.2 


72.3 


475.6 


2/3 


0.59 


0.48 


5,864.7 


5,455.2 


162.9 


198.3 


32.0 


24.3 


672.4 


1,570.2 


2/4 


0.60 


0.49 


0.98 


5,084.4 


158.9 


183.1 


29.9 


22.5 


634.8 


1,504.6 


2/5 


0.28 


0.47 


0.20 


0.21 


114.8 


22.5 


0.0 


-1.5 


119.7 


305.6 


2/6 


0.52 


0.31 


0.77 


0.76 


0.62 


11.3 


0.4 


0.3 


41.4 


56.9 


2/7 


0.06 


0.38 


0.40 


0.40 


0.00 


0.13 


1.1 


0.7 


2.9 


17.0 


2/8 


0.17 


0.14 


0.39 


0.39 


-0.18 


0.11 


0.82 


0.6 


1.6 


-0.7 


2/9 


0.43 


0.30 


0.39 


0.39 


0.49 


0.54 


0.12 


0.09 


519.6 


426.0 


2/10 


0.12 


0.63 


0.29 


0.30 


0.40 


0.24 


0.23 


-0.01 


0.26 


4,999.3 



Table 2: Covariances and correlations (in bold) of the ten histogram variables 
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7.1 Dynamic clustering 

The Dynamic Clustering Algorithm (DCA) represents a general reference 
for partitioning algorithms, in this paper we use it in a k-means like version. 
Let E he SL set of n data described by p histogram variables yj (j = 1, . . . 
The general DCA looks for the partition P E Pk oi E in k classes, among all 
the possible partitions Pk , and the vector L E oi k prototypes representing 
the classes in P, such that, the following A fitting criterion between L and P is 
minimized: 

A(P*,L*) = Mm{A(P,i) \ PEPk,LeLk}. (26) 

Such a criterion is defined as the sum of dissimilarity or distance measures 
6{xi, Gh) of fitting between each object Xi belonging to a class Ch & P and the 
class representation Gh G L: 



A(P,L) 



E E ^(-^ 



Gh). 



A prototype Gh associated to a class Ch is an element of the space of the 
description of E, and it can be represented as a vector of histograms. The algo- 
rithm is initialized by generating k random clusters or, alternatively, k random 
prototypes. We here present the results of two dynamic clustering using fc = 5. 
The former considers 5 as the squared Wasserstein distance among standardized 
data, while the latter uses the squared Mahalanobis- Wasserstein distance [3]. 
We have performed 100 initializations and we have considered the two partitions 
allowing the best quality index as defined in Chavent et al. [lo| : 



Q{Pk) = 1 



Y.h=iY.x,ec,Ji^^^Gh) 



where G^; is the prototype of the set E. Q{Pk) can be considered as the general- 
ization of the ratio between the inter-cluster inertia and the total inertia of the 
dataset. Comparing the two clustering results, we may observe that the two 





CI 1 


Clusters using 
Mahal. -Wass. distance 
CI 2 CI 3 CI 4 


CI 5 


Total 




CI 1 


2 


1 


7 




2 


12 


Clustering using 


CI 2 


4 


7 








11 


Wasserstein distance 


CI 3 






19 




2 


21 


between standardized 


CI 4 




3 




2 




5 


data 


CI 5 


2 




4 




5 


11 


Total 


8 


11 


30 


2 


9 


60 



Table 3: Cross-classification table of the clusters obtained from the two dynamic 
clusterings. 



clustering agree only on the 65% of the observations (see Table [3]): while DCA 
using Wassertein distance on standardized data allows a 61.53% of intra cluster 
inertia, DCA using Mahalanobis- Wassertein distance allows a 91.64%, but, con- 
sidering that Mahalanobis distance removes redundancy between the variables, 
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Figure 2: Dynamic Clustering of the China dataset into 5 clusters (in brack- 
ets there is the cardinality of the cluster) using the Wasserstein distance on 
standardized data Q{P5) = 0.6253. 




Figure 3: Dynamic Clustering of the China dataset into 5 clusters (in brack- 
ets there is the cardinality of the cluster) using the Mahalanobis- Wasserstein 
distance, ^(Ps) = 0.9164. 
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allows the definition of five clusters that collect stations at different elevations: 
the cluster 3 contains those stations between and 140 meters, cluster 5 between 
140 and 400 meters, cluster 1 between 500 and 900 meters, cluster 2 between 
1000 and 1800, and cluster 4 between 2,000 and 3,500 meters. Observing a 
physical map of China, the obtained clusters seems more representative of the 
different typologies of meteorological stations for their location and elevation. It 
is interesting to note that, also in this case, the use of a Mahalanobis metric for 
clustering data gives the same advantages of a clustering after a factor analysis 
(for example, a Principal Components Analysis) , because it removes redundant 
information (in terms of linear relationships) among the descriptors. 

8 Conclusions and future research 

In this paper we have presented new basic statistics for numerical modal-valued 
variable which have been developed using a metric of Wasserstein. The proposed 
statistics can be used in the interval data analysis whereas the intervals are 
considered as uniform densities according to Bertrand and Goupil @ and Billard 
0. Using the Wasserstein distance, we showed a way to compute standard 
deviation for standardize data, extending the classical concept of inertia for a set 
of numerical modal-valued data. The proposed dependence measures between 
variables can be considered as new useful tools for developing further analysis 
techniques for such kind of data. The next step, considered very hard from a 
computational point of view (see Cuesta-Albertos et al. 11]), is to find a way of 
considering the dependencies inside the observations for multivariate numerical 
modal-valued data in the computation of the Wasserstein distance. Further, 
a deeper study about inference based on such kind of data can give a great 
impulse to the research. 



A Proof of the decomposition of the Wasser- 
stein distance. 



Let us observe two density functions fi{y) and fj{y) having finite the first 
two moments. With each density function can be associated the distribution 
functions Fi (y) and Fj {y) , the means jii and Hj , the standard deviations Si and 
Sj where: 



<Pw ■■= I (F-Ht) - F-\t)fdt = 

(/i, - ^i,f + (s, - s,f + '2s,s,{l - CorrQQ{Y(i),Y{j)))^ 



(27) 



Location Size Shape 




Indeed 




— oo 



— oo 
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ii t = F{y) and considering that yx ^ F ^{F{y)) — F ^{t) by substitution we 
obtain 

1 

H = J F-\t)dt 



And where: 

+00 1 

s'{y)^ I y'fiy)dy~^,^^J{F-Ht)ydt-^,' 
-00 

for the same substitutions adopted above. 

Now let assume to center the two distributions using their means such that: 
z{i)= y{i) - /z, and F-^^{t) = z{i) and F-^^{t) = Fr\t) - ii. 
In [J| is proven that 

dw {y{i)M) {^^^ - P^jf + dl, {z{i),z{j)) (28) 

where 

1 

dl{z{^,z{j)):= J {F~'^{t)-F-'^{t)fdt (29) 


Developing the square we obtain 

{z{i), z{j)) := / {Fr^^{t)fdt + / {F-''\t)f dt 2 / Fr^^{t)F7^^'{t)dt = 
00 

= / {Fr\t) -f,,ydt + J {F-\t) - p.,fdt ~ 2 / {Fr\t) - m,) {Fj\t) - ^,) dt ^ 
00 

= si +s'^-2f {Fr\t) fM) {F^\t) - M,) 


(30) 

Let us consider the following quantity 



lF-^''(t)Fr^''(t)dt J^{F~\t)-t,,){F7\t)-t,,)dt 

PQQ 



J I {F-'-'{t)Ydt}{F7^-'{t)Ydt J}{Fr\t)~,.,Ydt}{F-\t)-t.,Ydt 
y y 

J {F-\t)-t,i){F7\t)-t,,)dt 



SiSj 

(31) 

It can be considered as the correlation of two series of data where each 
couple of observations is represented respectively by the t — th quantile of the 
first distribution and the t — th quantile of the second. In this sense we may 
consider it as the correlation between quantile functions represented by the 
curve of the infinite quantile points in a QQ plot. It is worth noting that 
< PQQ < 1 differently from the classical range of variation of the Bravais- 
Pearson's correlation index (— 1,+1). Equation ([5(1)) can be rewritten as 



dlv (^(0, ^(j)) := s'+s'-2 / {Fr\t) - {Fr\t) - m,) dt = sl+s]-2pQQS,s^ 



(32) 
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Adding and subtracting 2siSj we obtain 

(33) 

We may replace this result in ([25]) obtaining: 

dw i.yii),y{i)) ■= (Mi - Mj)^ + dw {z{i),z{j)) = 

= {Pt - Pjf + {Si - Sjf + 2s^S.J (1 - Pqq) 

QED 
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